Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sublayer compute function and example project for dense #62

Merged
merged 7 commits into from Jul 10, 2018

Conversation

jmduarte
Copy link
Member

@jmduarte jmduarte commented May 25, 2018

This is a PR to fix the memory problem (issue #59) when unrolling large loops.

The idea is to break up the loop by partitioning the output array for each layer call.

This PR only addresses the fully connected layer.

@nhanvtran
Copy link
Contributor

this looks like a great start. I did a quick check and the results are similar between the sublayers and layer computations -- but not exactly the same. I guess that you have to store the intermediate values and waste FFs.

Two next thoughts:

compute_layer(){

    //allocate multiplier resources
    #pragma HLS ALLOCATION instances=mul limit=multiplier_limit operation

    for (n sublayers){
         compute_sublayer()
    }
    merge_sublayers()
}

@benjaminkreis
Copy link
Member

Regarding the point on pruning, that is worth a try. We could also switch to calculating the number of nonzero multiplications on fly like we do for the convolutional: https://github.com/hls-fpga-machine-learning/hls4ml/blob/master/nnet_utils/nnet_conv.h#L108-L109

@nhanvtran
Copy link
Contributor

^^^ this

Maybe it's good to develop consistent machinery between conv and mlp?

@jmduarte
Copy link
Member Author

@nhanvtran we never replied to your idea about doing loops within loops.

My feeling is that by doing the separate sublayer calls within a loop, you'll end up with the same problem (i.e. it's going to try to unroll everything).

This is why I imagined having the hls_writer just write the sublayer calls sequentially. I actually almost have the update to hls_writer done so I'll update the PR and you can see how it will work for more generic dense networks.

@jmduarte
Copy link
Member Author

jmduarte commented Jun 11, 2018

You can test this hls_writer support with the following config (for example):

KerasJson: example-keras-model-files/KERAS_dense_big.json
KerasH5:   example-keras-model-files/KERAS_dense_big_weights.h5
OutputDir: my-hls-test-sublayer
ProjectName: myproject
XilinxPart:  xcku115-flvf1924-2-i
ClockPeriod: 5

IOType: io_parallel # options: io_serial/io_parallel
ReuseFactor: 50
DefaultPrecision: ap_fixed<16,6> 

The model is a big dense model: https://github.com/hls-fpga-machine-learning/keras-training/blob/master/models/models.py#L7

@jmduarte jmduarte changed the title [WIP] Add sublayer compute function and example project Add sublayer compute function and example project for dense Jun 12, 2018
@nhanvtran
Copy link
Contributor

@jmduarte writing the sublayers sequentially also works

@nhanvtran
Copy link
Contributor

So it looks like it's working well but I'm a little concerned about how this looks to the user. Is there a way to "wrap" all the sublayer calls so it's not in the main function of the HLS project. Similarly (and probably a little more importantly) there are so many sublayer configurations that doing some fine-tuning beyond the yaml configuration looks intractable. What do you think?

@jmduarte
Copy link
Member Author

jmduarte commented Jul 5, 2018

@nhanvtran, the latest commit addresses your comment about the aesthetics.

I think the code looks more straightforward with the many sublayer calls factorized into their own functions at the bottom of the myproject.cpp top file.

Take a look (you can run the config referenced above) and let me know if it's ok.

Thanks,
Javier

@nhanvtran
Copy link
Contributor

tested and will merge so that we can proceed to conv sublayers

@nhanvtran nhanvtran merged commit c3da0e7 into master Jul 10, 2018
violatingcp pushed a commit that referenced this pull request Feb 10, 2019
Add sublayer compute function and example project for dense
@jmduarte jmduarte deleted the jmgd/sublayer branch August 4, 2021 13:52
@ddddavid-he
Copy link

ddddavid-he commented Jan 9, 2023

ERROR: [XFORM 203-504] Stop unrolling loop 'Product1' (firmware/nnet_utils/nnet_dense_latency.h:85) in function 'nnet::dense_latency<ap_fixed<16, 6, (ap_q_mode)5, (ap_o_mode)3, 0>, ap_fixed<16, 6, (ap_q_mode)5, (ap_o_mode)3, 0>, config11>' because it may cause large runtime and excessive memory usage due to increase in code size. Please avoid unrolling the loop or form sub-functions for code in the loop body.

It seems that the problem still exists with the new writer.

I am converting some kind of large CNN model like this

model = Sequential([
    Conv1D(filters=5, kernel_size=5, strides=2, activation='relu'),
    MaxPool1D(pool_size=5, strides=3),
    Conv1D(filters=10, kernel_size=5, strides=2, activation='relu'),
    MaxPool1D(pool_size=5, strides=3),
    Conv1D(filters=20, kernel_size=5, strides=2, activation='relu'),
    Flatten(),
    Dense(120, input_shape=(20*15,), activation='relu'),
    Dense(64, input_shape=(120,), activation='relu'),
    Dense(2, input_shape=(64,), activation=None)
])

And the problem seems to occur in the dense layer. Is there any solution?

calad0i pushed a commit to calad0i/hls4ml that referenced this pull request Jul 1, 2023
…ing/jmgd/sublayer

Add sublayer compute function and example project for dense
@marina-neseem
Copy link

I am facing the same issue. I am converting a basic CNN.

model = Sequential()
model.add(Conv2D(64, kernel_size=3, activation='relu', input_shape=(28,28,1)))
model.add(Conv2D(32, kernel_size=3, activation='relu'))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))

I get the same error

ERROR: [XFORM 203-504] Stop unrolling loop 'Product1' (firmware/nnet_utils/nnet_dense_latency.h:37) in function 'nnet::conv_2d_cl<nnet::array<ap_fixed<16, 6, (ap_q_mode)5, (ap_o_mode)3, 0>, 64u>, nnet::array<ap_fixed<16, 6, (ap_q_mode)5, (ap_o_mode)3, 0>, 32u>, config4>' because it may cause large runtime and excessive memory usage due to increase in code size. Please avoid unrolling the loop or form sub-functions for code in the loop body.
ERROR: [HLS 200-70] Pre-synthesis failed.
command 'ap_source' returned error code
    while executing
"source build_prj.tcl"
    ("uplevel" body line 1)
    invoked from within
"uplevel \#0 [list source $arg] "

Is there any solution for it?

@jmduarte
Copy link
Member Author

jmduarte commented Aug 1, 2023

hi @marina-neseem, this just means you're trying to parallelize/unroll fully, e.g. using io_parallel, the CNN operations, and there's a limitation built into the HLS compiler.

There is another dataflow scheme in hls4ml called io_stream that we typically recommend for CNNs.

See https://fastmachinelearning.org/hls4ml/details.html#i-o-types

@layson-inventor
Copy link

hi @marina-neseem, this just means you're trying to parallelize/unroll fully, e.g. using io_parallel, the CNN operations, and there's a limitation built into the HLS compiler.

There is another dataflow scheme in hls4ml called io_stream that we typically recommend for CNNs.

See https://fastmachinelearning.org/hls4ml/details.html#i-o-types

I try to set it to io_type='io_stream', but I get the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants